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I am pleased to participate in this well-deserved 
recognition of Dennis Cook's remarkable career. 

Cook points out Fisher's insistence that predictor 
variables in regression be chosen without reference 
to the dependent variable. Reduction by principal 
components clearly satisfies that dictum. One of my 
primary objections to partial least squares regres- 
sion when I first encountered it as an alternative to 
principal components was that the predictor vari- 
ables were being chosen with reference to the de- 
pendent variable. (I now have other objections to 
partial least squares.) Yet on the other hand, vari- 
able selection in regression is well accepted and it 
clearly chooses variables based on their relationship 
to the dependent variable. Perhaps variable selection 
is better thought of form of shrinkage estima- 
tion rather than as a process for choosing predictor 
variables. 

Cook also reiterates something that I think is dif- 
ficult to overemphasize: Fisher's point that "More 
or less elaborate forms [for models] will be suitable 
according to the volume of the data." We see this 
now on a regular basis as modern technology pro- 
vides larger data sets to which elaborate models are 
regularly fitted. 

With regard to Cook's work, it seems to me that 
the key issue in the development of Cook's models 
(2), (5), (10) and (13) is whether they are broadly 
reasonable. The question did not seem to be exten- 
sively addressed but Cook shows that much can be 
gained if we can reasonably use them. When they 
are appropriate, the results in the corresponding 
propositions are rather stunning. It has long been 
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known that the best regression model available — 
technically the best predictor of a random variable 
y based on a p-dimensional random vector x — is the 
conditional mean E(y\x). The problem with this re- 
sult is that it requires us to know the joint distribu- 
tion of (x',y). Most of what we commonly recognize 
as regression analysis is an attempt to model the 
relationship E(y|x). This includes linear regression, 
nonlinear regression, generalized linear models and 
the various approaches to "nonparametric" (actu- 
ally, highly parametric) regression. Under the mod- 
els being considered, there exists a p x d matrix V 
such that 



y\T'x. 



This means that E(y|x) = E(y|r'x) regardless of what 
modeling strategy we choose to use. If anything, 
this dimensionality reduction from p to d is of more 
importance to nonparametric regression than other 
forms because, as the number of predictor variables 
increases, nonparametric regression gets hit harder 
by the curse of dimensionality than less highly para- 
metric forms. As a result, nonparametric regression 
should benefit most from the existence of a generally 
valid reduction in dimensionality. 

The issue with these four models is to estimate 
the column space of T, say, C(T). In the first six 
sections, the results are all closely tied to the eigen- 
vectors (principal component vectors) of some esti- 
mated covariance matrix for the predictor variables 
x, say Ti. For model (2), the space is spanned by the 
first d principal component vectors of the usual S. 
For model (5), the space is spanned by the first d 
principal component vectors of a restricted version 
of £. For models (10) and (13), the estimation pro- 
cedure is a bit more complicated. The key is that for 
both models (10) and (13) the population covariance 
matrix of x can be written as 

£ = TVDV'T + ToVoDoV^To, 

with D and -Do diagonal matrices, in such a way 
that 

e(tv) = (rv)D, z(r v ) = (r vb)A)- 
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This implies that the eigenvectors of E are either in 
C(r) or in C(Fq) = C(r)- 1 -, the orthogonal comple- 
ment of C(r). The problem is to establish which d 
out of the p orthogonal eigenvectors belong in C(T). 
To estimate C(T), find the orthogonal eigenvectors 
of S, say, v±,...,v p , and check the likelihood of ev- 
ery one of the p choose d combinations that has d 
of the ViS in C(T) and the remaining p — d vectors 
in the orthogonal complement. Whichever combina- 
tion maximizes the likelihood provides the estimate 
of C(r). In case (5) is large, Cook provides a sequen- 
tial selection method. The key difference between 
the procedures for models (10) and (13) is that the 
likelihoods are different. 

The remainder of my discussion is an attempt to 
put the question of estimating the reduced space 
into the context of multivariate linear model theory. 
To do this, I will change Cook's notation completely, 
so that the problem looks more like standard multi- 
variate linear models, but then reidentify the parts 
of the problem that interest Cook. I do not presume 
that any of this is new to Cook, but I found it helpful 
in understanding the process. 

In discussing multivariate linear models, liberal 
use is made of Kronecker products, Vec operators, 
and their properties; see, for example, Christensen 
(2002, Definition B.5 and Subsection B.5). Recall 
also that for a univariate linear model Y = Xf3 + e, 
E(e) = and Cov(e) = V, 

SSE = (Y- XpyV-^Y - X0) 
= Y'V~ 1 (Y - Xj3). 

Moreover, least squares estimates are BLUE's, and 
thus maximum likelihood, if C(VX) C C(X). We 
will apply these facts to the multivariate models. 
Finally, let Jf. denote an r x s matrix of l's with 
J r = J r \ and for a matrix A let Pa = A(A'A)~A' 
be the perpendicular projection operator (ppo) onto 
C(A) with t(A) the rank of A. 

The standard multivariate linear model involves 
dependent variables yi, . . . ,y q . If n observations are 
taken on each dependent variable, we have y^, i = 
1, . . . , n, h = 1, . . . , q. Let Y h = [y lh , y nh ]' and y\ = 
[yn, . . . ,yiq\. For each h, we have a linear model, 

Y h = X/3 h + e h , 
E{e h ) = 0, 
Cov(e h ) = a hh I, 



where X is a known n x p matrix that is the same 
for all dependent variables, but @h and the error vec- 
tor e/j = [eih, • • • , e n h]' are peculiar to the dependent 
variable Yh- 

The multivariate linear model consists of fitting 
the q linear models simultaneously. Letting 



Y n xq = 


[Yi,.. 
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the multivariate linear model is 

(1) Y = XB + e. 

Alternatively, thinking of X as a matrix with rows 
x\ and e as having rows we can write the multi- 
variate linear model as 

y'i = x 'i B + e 'n i = l,...,n. 

To perform maximum likelihood, we assume that 
the Si's are independent N(0, S) random vectors. It 
is reasonably well known that for any ppo Pa, 

(2) E Ylx (Y'P A Y) = r(A)E + B'X'P A XB. 

£ is now being used for the conditional covariance 
matrix oiy\x, whereas Cook used S for the marginal 
covariance matrix of x. 

A generalization of the multivariate linear model 
that is often associated with growth curve models 
(cf. Christensen, 2001) is 

(3) Y = XTZ' + e, 

where the unknown parameter matrix B in (1) is 
replaced by a product of a reduced parameter ma- 
trix r that is p x d and a fixed, known matrix Z 
that is q x d with r(Z) < d < q. This is essentially 
Cook's model (5) when applied to data and using 
drastically different notation. (Our Y is his X, our 
X is his known function of y, F y , our Z is his T, 
etc.) The ultimate goal of our exercise is to drop 
the assumption that we know Z and estimate it, or 
more properly C(Z), from the data. But for now, we 
act as if Z is known. Note that the "growth curve" 
model specifies something akin to a linear model for 
each row of Y , 

yi = Z(Txi) +£», i = l,...,n. 

Moreover, using Kronecker products and Vec oper- 
ators, we can turn the multivariate growth curve 
model (3) into a univariate linear model, 

(4) Vec(y) = [Z(g)X]Vec(r) + Vec(e), 
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with 

Vec(e)~JV(0,[S®Jn]). 

There are a couple of refinements to model (4) 
used in Cook's development. First, the growth curve 
model is specified as 

(5) Y = Jy! + XTZ' + e, 

with J'X = and Z' Z = Id- As a linear model (5) 
becomes 

Vec(Y) = [I q ® J n ]u + [Z ® X] Vec(r) + Vec(e). 

Second is the assumption in Cook's models (2) and 
(5) that 

£ = 

in which case 

Cov[Vec(e)] = a 2 [I q <g> I n ] = a 2 I nq , 

so standard estimation results apply to the model. 
In particular, least squares estimates of the parame- 
ters \i and Vec(r) are maximum likelihood estimates 
and the likelihood function for fixed a 2 evaluated at 
the maximum likelihood estimates of fj, and T is, 
ignoring the constant, 



(6) 



nq 



log(cT 2 ) 



SSE 



2 ~~ ov ~ ' 2a 2 ' 
Performing the usual computations necessary to 
finding least squares estimates, but using properties 
of Kronecker products and Vec operators and ex- 
ploiting the fact that since J'X = we have C([J ? (8) 
J n ]) JL C( [Z <g> X] ) so that estimation of a and T can 
be performed separately, the least squares estimates 
reduce to 

fi = y., f = (x'xyx'YZ(z'z)~, 

or, alternatively, 

XTZ = P X YP Z . 

The maximum likelihood estimate of a 2 is obtained 
by differentiating (6) with respect to a 2 and setting 
it equal to 0, yielding a 2 = SSE /nq. In particular, 
the perpendicular projection operator onto C(\I q ® 
J n ], [Z ® X]) is [I, ® (1/n) J«] + [P z ® Pjf], so 

SSP = Vec(y)'[I g ® (I - (1/n) Vec(Y) 

-Vec(F)'[P z ®P x ]Vec(Y) 

= ||Vec[(/- (1/n) J n ")y] f 

(7) 

-||Vec(P x yPz)|| 2 
= tr[T'(/-(l/n)OT 
-tT[P z Y'P x YPz\. 



Using notation analogous to Cook's, three estima- 
tors that we will use frequently are 

t^-Y'(l--AY, 
n \ n J 

E fit = -Y'PxY, 
n 

t Tes = -y'(i--j%-p x )y. 

n \ n J 

Note that £ is the maximum likelihood estimate of 
the covariance matrix when fitting the usual multi- 
variate one-sample model Y = J n p! + e. Using the 
assumption Z' Z = 1^ so that ZZ' = P z , 

SSE = tr[n£] - tv[Z'nE &t Z]. 

As Cook points out, this depends on C{Z) rather 
than Z itself. 

We are finally in a position to address Cook's 
question, the fact that we do not actually know 
Z . To maximize the likelihood (6) as a function of 
Z we need to maximize Xx\Z' nYi§x i Z\ as a function 
of Z subject to Z' Z = Id- If we think about find- 
ing the columns of Z sequentially, that is, finding 
z\ to maximize z'tj^z subject to ||zi|| 2 = 1, then 
finding Z2 to maximize z'Yj^z subject to ||^2|| 2 = 1 
and z' 2 zi = 0, and so on, this is a standard prob- 
lem in multivariate analysis that is solved by find- 
ing the eigenvectors of £g t relative to the eigenvalues 
Ai > A2 > • • • > Xg > 0. Of course, since Z is q x d, 
we consider only the first d eigenvectors. 

To examine a model equivalent to Cook's model 
(2), we consider the most extreme (largest) choice 
for X, which is C(X) = C(J n ) _L . In this case, Px = 
I n — (l/n)J™ so that £g t = £. It follows that the 
maximum likelihood estimate of Z consists of the 
first d principal component vectors. As pointed out 
by Cook, the number of parameters in our Z matrix 
is pd. However, with this choice of X, p = n — 1, so 
the number of parameters rises with the sample size. 

For Cook's models (10) and (13) in Section 6, 
the covariance structure changes. As indicated ear- 
lier, the estimation methods ultimately involve de- 
termining which of the principal component direc- 
tions are most likely where principal components 
can be computed from some estimate of £, which 
may be any, or preferably all, of £, Efit or £ res - For 
Cook's model (13) we again have 

Vec(Y) = [I q ® J n ]n + [Z (8) X] Vec(r) + Vec(e), 



R. CHRISTENSEN 



but recalling that Z ' Z = 1^, we now incorporate a 
matrix Zq with Z Q Z = and Z' Zq = I q -d an d as- 
sume 

Vec(e) ~ N(0, [Z nlZ + z ^ z ' ® J nD- 

Observe that least squares estimates will still be 
BLUEs and thus maximum likelihood estimates 
because C([Z ^lZ' Q + ZSl 2 Z< ® I n ][Z ® X)) C 
C([Z®X]). 

The S'S'E now involves the perpendicular projec- 
tion operators as in (7), but also involves the in- 
verse of the covariance matrix. With our assump- 
tions about Z and Zq, 

[z nlz' + zn' 2 z'®i n }- 1 
= [z nfz' Q + zn~ 2 z' si n ]. 

The SSE becomes 

SSE = Yec(Y)'[(Zn- 2 Z' + Z^ 2 Z' Q ) 

®(I-(l/n)J%)]Vec(Y) 

- Vec{Y)'[Zn~ 2 Z' ® P x ] Vec(Y) 
= Vec(Y)'Vec[(/-(l/n)J") 

• Y(zn- 2 z' + z % 2 z' )] 

-Vec{Y)'Vec(PxYZn~ 2 Z') 
= tv[Y'(I - (l/n)JZ)Y(Zn~ 2 Z' + Z % 2 Z' )} 

- tr(Y'P x YZQ- 2 Z') 

= tr[n^Z' Y'(I-(l/n)JZ)YZ ) 

+ \x{Q- 2 Z'[Y'{I - (l/n)J%)Y - Y'P X Y]Z} 
= tr[n Q 2 Z' ntZ ] + tr{n~ 2 Z'[n£ - nt Rt ]Z}. 
The likelihood will be 

^ io g (\z n 2 z' + zn 2 z'\) + ^-sse 



-n 



2 

+ ■ 



iog(l«g| 



-n 



+ — tr[fio 2 ^SZ ] 



; log(|^ 2 |) + — tr(irV[S - E fit ]Z), 

which, maximizing over and f2o, Cook indicates 
reduces to 

^ log(|^SZo|) + ^ log(|Z'[E - E fit ]Z|). 

As before, Cook's model (10) can be viewed as the 
special case where C(X) = C(J n ) , so that the sec- 
ond term in the likelihood disappears. 

To continue this discussion, we need to leave the 
conditional world of linear models and consider the 



unconditional expected values of E, Efit and E res . 
Conditionally, applying (2) to model (5) when J' A = 
gives 

(8) E Ylx (Y'P A Y) = r(A)E + ZT'X'P A XTZ'. 

For E and Eg t the appropriate ppo has PaX = X 
and for E res the ppo has PaX = 0. We have assumed 
that J'X = 0, which is only reasonable if the ran- 
dom rows of X have been adjusted by their sample 
means; nonetheless, it is reasonable to define the 
marginal covariance matrix of a row of X as 



V x 



1 



n 



1 



Ex(X'X). 



These results quickly yield the expectations 

n — 1 _ n — 1 
-E + — 



E(S) 
E(S fit ) 
E(E res ) 



n 

r(X) 

n 
n — 1 



-zrV x rz', 



E + 



n 



1 



n 



-ZT'V x TZ', 



ii 



In particular, with E = ZqQqZq + Zfl 2 Z', Cook's 
Proposition 4 says that the estimates converge in 
probability to the limits of their expected values. 

Cook's second simulation has a true model with 
d=l,n = 250, q=W (his p), p = 1, T = 1, Q 
Qq = aolg and V x = a 2 (his ay). Here, 

±a 2 Z Z' + —(a 2 + a 2 x )ZZ', 



cr, 



E(E) 
E(E fit ) 

E(E rcs ) 



n 
v(X) 



n 



n 



(TqZqZq 



+ 



n 



r(X) 



n 



n 

l-r(X) 



ZZ' 



+ 



n 

n—1 



a o z o z o 



n 



Cook's simulation results make good sense in terms 
of these expected values. Terms involving i{X)/n 
should be unimportant. When do is small, E(E) is 
dominated by {a 2 + a 2 )ZZ', which is larger than 
the corresponding terms a 2 ZZ' and a 2 ZZ' for Eg t 
and E res , respectively, so it works best. When ctq is 
comparable to a and a x , Eg t works well, because 
E(Efi t ) is much less affected by Oq than the other 
estimates. And when <jq is large, E ros works well 
because then we need to look at the eigenvectors for 
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small eigenvalues of E rcs and E, but the term a 2 ZZ' 
for E res is smaller than the term {p 2 + cr^ZZ 1 for 
E, whereas the expected value of Eg t is relatively 
unaffected by Oq getting large. As Cook mentions, 
when a 2 + a 2 = <Tq , there is very little ability for E to 
identify C(Z) because then E(E) = (n — \)oQ/nI q , so 
we cannot really expect the eigenvectors of E to help 
us identify C(Z). Similarly, when <r 2 = <7q, E(E res ) = 
( n -l-r(X)) ff0 2 /n7 9 . 

Cook's models (2), (5), (10) and (13) involve spe- 
cialized structure for E. His model (16) allows for 
general E. Nonetheless, the expectations of the esti- 
mates show that there should almost always be some 
ability to estimate C(Z). In particular, 



E 



n 



r(X) 



Efit 



n 



n 



1 



n-l-r(X) 



ZT'V X TZ' 



so the first d principal component vectors of the es- 
timate ^p^yEfi t — ^3jE should be at least a reason- 
able estimate of a basis for C(Z). For large sam- 
ples this is similar to looking at the directions de- 
termined by Efi t , but in the extreme case oiC(X) = 
C(J n ) ± , the estimator is degenerate at 0. Of course, 
according to Cook's Proposition 6, for the general 
E (Cook's a 2 A) of model (16), it no longer suffices 
to estimate C(Z); we need to estimate C{TT 1 Z). 
Fortunately, E rcs provides an estimate of E, so we 
can just transform the estimated basis for C(Z). For 
large n, it makes sense to base estimation of C(Z) on 
Efit, but rather than transforming its eigenvectors, 
one could alternatively look directly at the eigen- 
vectors of E~gEg t , which is Cook's recommendation 
when p = d. This intuitive approach based on the 
expected values of (matrix) quadratic forms seems 
analogous to using Henderson's method 3 for esti- 
mating variance components, whereas Cook is rec- 
ommending a maximum likelihood procedure, which 
I suspect is better. 

I found the relationship between model (16) and 
ordinary least squares (OLS), discussed in Section 7.4, 
fascinating. It seems that the gains to be had over 
OLS demonstrated in the simulations are due to im- 
posing alternative structure on the inverse regres- 
sion relationship. I am comforted by the idea that 



using OLS is not bad but rather, if we can find an ap- 
propriate model with more structure, we can do bet- 
ter than OLS. Of course, this only applies when the 
sufficient reduction is one-dimensional. With more 
dimensions, we need our full range of regression tools 
to develop a relationship between the dependent vari- 
able and the reduced predictor variables. 

I initially found Cook's discussion of standardiza- 
tion in Section 7.3 disturbing. I am not dogmatic 
about the need to standardize variables prior to find- 
ing principal components. When the measurements 
are all on similar scales, using the original scales 
seems reasonable to me, as when measuring the height, 
length and width of turtle shells in centimeters. On 
the other hand, if I measure length in kilometers 
and height and width in millimeters, the first prin- 
cipal component will essentially ignore the lengths, 
regardless of any role that length might play in pre- 
diction. I suspect that one point of Cook's discus- 
sion is that in a situation where you need to stan- 
dardize the variables, there will be little reason to 
suppose that his models (2) or (5) are appropriate, 
which means there is little reason to use principal 
components. More generally, his point seems to be 
that standardization is necessary but that a more 
complete standardization than merely rescaling the 
variables is needed. 

As I indicated at the beginning of my discussion, 
my biggest problem with these procedures is that I 
do not have a good feel for when the models (2), (5), 
(10) and (13) will be appropriate. Multivariate lin- 
ear model theory should allow us to use E res to test 
the assumption of Cook's models (2) and (5) that 
E = a 2 I. I am less sure if it will provide a test of 
whether E = a 2 ZZ' + (TqZqZq, when Z is unknown, 
but a generalized likelihood ratio test seems plausi- 
ble. In any case, the procedures in Section 7.2 seem 
generally applicable. 
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